-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Ignore remote ES|QL execution failures when skip_unavailable=true #116365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java
Show resolved
Hide resolved
72a0519 to
a9d0b09
Compare
|
Hi @smalyshev, I've created a changelog YAML for you. |
d7d55f5 to
2842278
Compare
quux00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass review with some questions and comments.
.../esql/compute/src/main/java/org/elasticsearch/compute/operator/exchange/ExchangeService.java
Outdated
Show resolved
Hide resolved
.../esql/compute/src/main/java/org/elasticsearch/compute/operator/exchange/ExchangeService.java
Outdated
Show resolved
Hide resolved
|
|
||
| /** | ||
| * Marks the cluster as PARTIAL and adds the exception to the cluster's failures record. | ||
| * Currently, additional failures are not recorded, TODO: check if this should be the case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK to add more failures to the cluster metadata list. In search when shard level searches occur, a cluster might have multiple failures listed in the array, so feel free to do that here if there is a use case for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'll do it later maybe, for now I think I want to deal with a single failure first without the complication of handling more than one.
| import java.util.concurrent.TimeUnit; | ||
| import java.util.concurrent.atomic.AtomicBoolean; | ||
|
|
||
| import static org.elasticsearch.xpack.esql.session.EsqlSessionCCSUtils.markClusterEmptyInfo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking about the EsqlSessionCCSUtils is that it would be specific to plan-time operations and not used at execution time (that's why I renamed it from Costin's CcsUtils to EsqlSessionCCSUtils). So I would want us to think about if there are enough "helper" methods needed at execution time to create a "ComputeCCSUtils"? If not, then we should rename EsqlSessionCCSUtils to maybe just "EsqlCCSUtils".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now there's one method only - one that creates that "empty" cluster state (which is yes, not empty, see below) but if there's more then we could move it. I don't want to create another utils just for one method though.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSessionCCSUtils.java
Outdated
Show resolved
Hide resolved
|
Hi @smalyshev, I've updated the changelog YAML for you. |
dnhatn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the ComputerService and ComputeListener. I think we're getting closer. Thanks for your iterations on this @smalyshev.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java
Outdated
Show resolved
Hide resolved
| }; | ||
| // Cancel the group on sink failure | ||
| ActionListener<Void> exchangeListener = computeListener.acquireAvoid().delegateResponse((inner, e) -> { | ||
| taskManager.cancelTaskAndDescendants(groupTask, "exchange sink failure", true, ActionListener.noop()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should wait for the cancellation here. I think we should also fix the same issue in ComputerListener. Can you move this into the cancellation listener?
if (suppressRemoteFailure) {
computeListener.markAsPartial(clusterAlias, e);
taskManager.cancelTaskAndDescendants(groupTask, "exchange sink failure", true, ActionListener.running(() -> inner.onResponse(null));
} else {
inner.onFailure(e);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, wouldn't that code not cancel task on failure?
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSessionCCSUtils.java
Outdated
Show resolved
Hide resolved
| assertClusterStatusAndShardCounts(remote2Cluster, EsqlExecutionInfo.Cluster.Status.SKIPPED); | ||
| } | ||
|
|
||
| // skip_unavailable=true clusters are unavailable, both marked as PARTIAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test belongs here?
This is in the test for testUpdateExecutionInfoWithUnavailableClusters, which only ever sets status to SKIPPED, not PARTIAL.
This test should go in different method (not sure which though).
...l/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/CrossClustersQueryIT.java
Show resolved
Hide resolved
b0ee146 to
1f1601b
Compare
|
Superceded by #121240 |
Catch the error coming from remote cluster and mark it as PARTIAL. Looks like we need to do it in two places, since otherwise
acquireAvoidwill take over and cancel the whole task, which we don't want to happen.In runtime, the following failures can happen, which need to be covered:
See also: #112886